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I Abstract 

Sparsity in the eigenvectors of signal covariance matrices is exploited in this paper for compression and 
denoising. Dimensionality reduction (DR) and quantization modules present in many practical compression 
schemes such as transform codecs, are designed to capitalize on this form of sparsity and achieve improved 
reconstruction performance compared to existing sparsity-agnostic codecs. Using training data that may 
be noisy a novel sparsity-aware linear DR scheme is developed to fully exploit sparsity in the covariance 
eigenvectors and form noise-resilient estimates of the principal covariance eigenbasis. Sparsity is effected 
via norm-one regularization, and the associated minimization problems are solved using computationally 
efficient coordinate descent iterations. The resulting eigenspace estimator is shown capable of identifying a 
subset of the unknown support of the eigenspace basis vectors even when the observation noise covariance 
matrix is unknown, as long as the noise power is sufficiently low. It is proved that the sparsity-aware 
estimator is asymptotically normal, and the probability to correctly identify the signal subspace basis 
support approaches one, as the number of training data grows large. Simulations using synthetic data 
and images, corroborate that the proposed algorithms achieve improved reconstruction quality relative to 
alternatives. 
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I. Introduction 

Data compression has well-appreciated impact in audio, image and video processing since the increasing 
data rates cannot be matched by the computational and storage capabilities of existing processors. The 
cornerstone modules of compression are those performing dimensionality reduction (DR) and quantization, 
as in e.g., transform codecs ifToll . DR projects the data onto a space of lower dimension while minimizing 
an appropriate figure of merit quantifying information loss. Quantization amounts to digitizing the analog- 
amplitude DR data. Typically, DR relies on training vectors to find parsimonious data representations 
with reduced redundancy without inducing, e.g., mean-square error (MSE) distortion in the reconstruction 
process. One such property that promotes parsimony is sparsity. 

Sparsity is an attribute characterizing many natural and man-made signals (2), and has been successfully 
exploited in statistical inference tasks using the least-absolute shrinkage and selection operator (Lasso) 
|29l , l35l . In parallel, recent results in compressive sampling rely on sparsity to solve under-determined 
systems of linear equations, as well as sample continuous signals at sub-Nyquist rates Q. These tasks 
take advantage of sparsity present in deterministic signal descriptors. But what if sparsity is present in 
statistical descriptors such as the signal covariance matrix? The latter is instrumental in compression when 
the distortion metric is MSE. In bio-informatics and imaging applications the data input to the DR module 
have covariance matrices whose eigenvectors admit a sparse representation over a certain domain, such as 
the wavelet domain |[T8l . 

The 'workhorse' for DR is principal component analysis (PCA) (51, |[T9ll , which relies on the covariance 
matrix to project the data on the subspace spanned by its principal eigenvectors. So far, sparsity has been 
exploited for interpreting each principal component, but not for reconstructing reduced-dimension signal 
renditions. Specifically, the standard PCA criterion has been regularized with an ^i-norm penalty term 
to induce sparsity as in the Lasso, and perform variable selection of the data entries that significantly 
contribute to each principal component fl20l . However, the nonconvex formulation of f20| does not lend 
itself to efficient optimization. The sparse PCA formulation in [36] leads to a cost minimized using the block 
coordinate descent optimization method O; see also |[23l which includes a weighted £i-norm sparsity - 
imposing penalty. PCA with cardinality constraints on the number of nonzero elements per principal 
eigenvector has also been considered using relaxation techniques (H, |9). Alternative approaches augment 
the standard singular value decomposition (SVD) cost, or the maximum likelihood criterion with l\ (or 
£q) penalties to effect sparsity in the principal eigenvectors |fl8l , 11271 , |[3Tl - |[33l . Sparsity has also been 
exploited to render PCA robust to outliers O, as well as to reduce complexity at the encoder Ifl2l . In all 
the aforementioned schemes sparsity is not exploited for reconstructing signals that have been compressed 
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by DR and quantization. 

When dealing with noisy data, pertinent reconstruction techniques have been developed to perform joint 
denoising and signal reconstruction |[25l . ||26*1 . ll28l . However, existing approaches rely on the noise second- 
order statistics being available. Here, the a priori knowledge that the signal covariance eigenvectors are 
sparse is exploited, and joint denoising-reconstruction schemes are introduced without requiring availability 
of the noise covariance. 

The standard PCA cost is augmented with pertinent l\— and £2— norm regularization terms, that fully 
exploit the sparsity present in the covariance eigenvectors when performing not only feature extraction 
(as in J9), GUI . ll33l . 11361 ) but also reconstruction. The resulting bilinear cost is minimized via coordinate 
descent which leads to an efficient sparse (S-) PCA algorithm for DR and reconstruction.Its large-sample 
performance is analyzed in the presence of observation noise. If the ensemble covariance matrix of the 
noisy training data is known, then the sparsity-aware estimates do better in terms of identifying the support 
of the principal basis vectors when compared to the standard sparsity-agnostic PCA. As the number 
of training data used to design the DR module grows large, the novel sparsity-aware signal covariance 
eigenspace estimators: i) are asymptotically normal; and ii) identify the true principal eigenvectors' support 
(indices of nonzero elements) with probability one. The last two properties, known as 'oracle' properties 
||35l , combined with the noise-resilience enable the novel S-PCA to attain an improved trade-off between 
reconstruction performance and reduced-dimensionality-a performance advantage also corroborated via 
numerical examples. The proposed sparsity-aware DR scheme is finally combined with a vector quantizer 
(VQ) to obtain a sparsity-aware transform coder (SATC), which improves reconstruction performance when 
compared to standard TC schemes that rely on the discrete cosine transform (DCT) or PCA transform. The 
merits of SATC are also demonstrated in the compression and reconstruction/denoising of noisy images 
that have been extracted from [fl]]. 

The rest of the paper is organized as follows. After stating the problem setting in Sec. II, the proposed 
sparsity-aware PCA formulation is introduced in Sec. III-A. Coordinate descent is employed in Sec. III-B to 
minimize the associated bilinear cost, while a computationally simpler element-wise algorithm is derived 
in Sec. III-C. A cross-validation scheme is outlined in Sec. III-D for selecting the sparsity-controlling 
coefficients that weigh the ^i-norm based regularization terms. Asymptotic properties are derived both in 
the noiseless and noisy cases, establishing the potential of the novel estimators to recover the underlying 
signal covariance principal eigenvectors (Sees. IV-A, IV-B). S-PCA is combined with vector quantization 
in Sec. V. Synthetic numerical examples (Sec. VI-A) illustrate the performance advantage of SATC, while 
tests using real noisy images corroborate the potential of SATC in practical settings (Sec. VI-B). 
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II. Preliminaries and Problem Formulation 

Consider a collection of n training data vectors {x 4 = s t + Wf}" =1 , each containing the signal of interest 
s t £ IR pxl , in additive zero-mean possibly colored noise w t , assumed independent of s t . It is also assumed 
that Si lies in a linear subspace of reduced dimension, say r < p, spanned by an unknown orthogonal 
basis {u StP } r p=1 . Many images and audio signals lie on such a low-dimensional subspace. Accordingly, s t 
for such signals can be expressed as 

r 

s t = V s + nt >P Us >P> t = l,...,n (1) 

where fi s denotes the mean of s t , and 7r ttP are zero-mean independent projection coefficients. 

The covariance matrix of the noisy is given by S x = X s + E^, where S s (S w ) denotes the signal 
(noise) covariance matrix. Consider the eigen-decomposition S^, = U„,D W U^, where XJ W denotes the 
eigen-matrix containing the eigenspace basis, and D„, := diag(d 1U) i, . . . d w>p ) the corresponding eigenvalues 
( T denotes matrix transposition). Likewise, for the signal covariance matrix let S s = U S D S U^ = 
U S)r D SiT .Uj r where U Sjr := [u Sj i . . . u s r ] and D s r := diag(d aj i, . . . , d s>r ) contain the r dominant 
eigenvectors and eigenvalues of S s , respectively; while d SyP := E[ir p t ]. Further, let U SiP _ r £ W x (p~ r ) 
denote the matrix formed by the subspace of dimensionality p — r, which is perpendicular to the signal 
subspace U Sjr . In the following, fi s is assumed known and subtracted from s t ; thus, without loss of 
generality (wlog) x t and s t are assumed zero-mean. Matrices S s and H w are not available, which is the 
case in several applications. Moreover, the following is assumed about sample covariances. 
(al) The signal of interest St and observation noise wj are independent across time t and identically 
distributed. Thus, by the strong law of large numbers the sample covariance matrix estimate ^ x ,n := 
n 1 St=i x t x ?" converges almost surely, as n — > oo, to the ensemble covariance matrix Yl x . 

Consider next a unitary transformation matrix F £ M. pxp to form the transformed data 

r 

x i: =Fxi = ^ ^t,pFu s , P + Pw t . (2) 
p=\ 

Such a mapping could represent the wavelet, Fourier, or, the discrete cosine transform (DCT). The case of 
interest here is when F is such that the transformed eigenvectors u s >p := Fu s p of Sg, where s := Fs, have 
many entries equal to zero, i.e., has sparse eigenvectors. One natural question is whether eigenvectors 
of a covariance matrix admit a sparse representation over e.g., the DCT domain. Often in bio-informatics 
and imaging applications the data input to the DR module have covariance matrix with sparse eigenvectors 
|[T8l . The same attribute is also present in other classes of signals. Consider for instance signal vectors 
comprising uncorrected groups of entries, with each group containing correlated entries - a case where 
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the covariance matrix is block diagonal. In addition to block diagonal covariance matrices, the class also 
includes row- and/or column-permuted versions of block diagonal covariance matrices. 

An example is provided next to demonstrate that S| can have sparse eigenvectors. A high-resolution 
image taken from [1]| displaying part of the Martian terrain, was split into 112 smaller non-overlapping 
images of size 180 x 256. Each of these images was further split into 8x8 blocks. Vectors {st}Jl^ 
correspond to a block (here the 10th after lexicographically scanning each of the 112 sub-images) com- 
prising entries with the same row and column indices in all 112 different sub-images. An estimate of 
the underlying covariance matrix of the vectorized blocks is formed using sample averaging, to obtain 
T, s := (112) -1 Yld"=i s t s I- The rank(X! s ) = 15 indicates that the 64 x 1 vectorized image blocks lie 
approximately in a linear subspace of IR 64xl having dimension r = 15. This subspace is spanned by 
the 15 principal eigenvectors of S s and forms the signal subspace. To explore whether the principal 
eigenvectors of S s have a hidden sparsity structure that could be exploited during DR Fig. [TJ (left) depicts 
the value for each of the entries of the second principal eigenvector of S s , namely u s .2, versus the index 
number of each entry. Note that the entries of u Sj 2 exhibit a sinusoidal behavior; thus, if DCT is applied 
to u Sj 2 the resulting vector u S 2 = has only a few DCT coefficients with large magnitude. Indeed, 

Fig. [TJ (right) corroborates that u x .2 is sparse over the DCT domain. In fact, all 15 principal eigenvectors 
of S s admit a sparse representation over the DCT domain as the one displayed in Fig. [TJ (right). Thus, 
the sample covariance matrix of the transformed vectorized blocks s t = Fs t has 15 principal eigenvectors 
that exhibit high degree of sparsity. Such a sparse structure is to be expected since images generated 
from [TJ exhibit localized features (hilly terrain), which further result in sparse signal basis vectors in the 
DCT domain [18]. For simplicity, the original notation x t , s t and w t will henceforth refer to the DCT 
transformed training data, signal of interest, and observation noise, respectively. 

Aiming to compress data vector x, linear DR is performed at the encoder by left-multiplying x with 
a fat matrix C € R <?xp , where q < r < p. The reduced dimension q may be chosen smaller than the 
signal subspace dimension r, when resources are limited. Vector Cx is received at the decoder where it 
is left-multiplied by a tall p x q matrix B to reconstruct s as s := BCx. Matrices B and C should be 
selected such that s = BCx forms a 'good' estimate of s. One pair of matrices B ,C , minimizing the 
reconstruction MSE 

(B , C ) G arg min E[\\s - BCxf] (3) 

B,C 

are given as C D = Uj x H SX J1~ 1 , B D = XJ SXj q, where XJ SX:q are the q principal eigenvectors of 
XsajE" 1 !^, while the £ notation emphasizes that ([3]) does not have a unique optimal solution (e.g., 
see (5j Ch. 10]). In the absence of observation noise, (x t = s t ), © corresponds to the standard PCA, 
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where a possible choice for B Q , C D is B D = U Si9 and C Q = U^ q JU Ch. 9]. Since the ensemble covariance 
matrices are not available; C and B D cannot be found. The practical approach is to replace the cost in 
© with its sample-averaged version n _1 ||S — BCX|||,, where S := [si . . . s n ] and X := [xi . . . x n ]. This 
would require training samples for both x, and the signal of interest s (5] Ch. 10]. This is impossible in 
the noisy setting considered here. The reduced dimension q < p can be selected depending on the desired 
reduction viz reconstruction error afforded. 

In the noiseless case, the optimal DR and reconstruction matrices are formed using the signal eigen- 
vectors, that is C = Bj = U^q. But even in the noisy case, the signal subspace U Si0 is useful for 
joint DR and denoising [25]], [261, EU, O. Indeed, if C = B T = JJ^ q , then it is easy to see that 
J rec (B,C) := E[\\s - BCx||2] = tr(£™) - tr(U^ p „ g S w U s , p _ ) < tr(£™) when q = r < p. Thus, 
projection of x f onto the signal subspace not only achieves DR but also reduces noise effects. The question 
of course is how to form an estimate for V s ,g when S x and are unknown. Existing signal subspace 
estimators assume either that i) the noise is white, namely H w = <7^I p for which U x = U s (I p denotes 
the p x p identity matrix); or, ii) the £„, is known or can be estimated via sample-averaging (25], [261 , 
[28l . [341 . In the setting here these assumptions are not needed. 

Based on training data {x 4 }™ =1 , the major goal is to exploit the sparsity present in the eigenvectors of 
S s in order to derive estimates B, C for the signal subspace U s . q , thereby achieving a better trade-off 
between the reduced-dimension q and the MSE cost J rec (B,C) than existing alternatives (9], EUl . ll33l . 
|36ll . Towards this end, a novel sparsity-aware signal subspace estimator is developed in the next section. 
Since the majority of data processing and communication systems are digital, this sparsity-exploiting DR 
step will be followed by a sparsity-cognizant VQ step in order to develop a sparsity-aware transform 
coding (SATC) scheme for recovering s t based on quantized DR data. 

III. Sparse Principal Component Analysis 

Recall that the standard PCA determines DR and reconstruction matrices C D and B D by minimizing 
the sample-based cost J rec (B,C) := n _1 ||X — BCX|||,. One possible minimizer for the latter is B G = 
C„ = XJ X: q, where U X) q comprises the q dominant eigenvectors of ~E X . In the noiseless case it holds that 
S x = from which it follows that \J x ,g = U s However, the g-dominant eigenvectors of S x do not 
coincide with U<j j(? when the additive noise is colored (£ w ^ c^Ip)- In this case, standard PCA is not 
expected to estimate reliably the signal subspace. 
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A. An i\-regularized formulation 

Here the standard PCA formulation is enhanced by exploiting the sparsity present in U s r . Prompted by 
Lasso-based sparse regression and PCA approaches GUI . |[29l , |[33l , 11361 , the quadratic cost of standard 
PCA is regularized with the i'l-norm of the unknowns to effect sparsity. Specifically, B G = C T Q = U S (? 
could be estimated as 

q p 

argminn^llX-BCXlll + ^^A^IC^j)! + \B(j,p)\), s. to B = C T (4) 

p=l 3=1 

which promotes sparsity in U Sjr . However, the constraint B = C T leads to a nonconvex problem that 
cannot be solved efficiently. This motivates the following 'relaxed' version of (0]), where the wanted 
matrices are obtained as 

q p 

(B,C) e argminn^HX - BCX||^ + V V \ p (\C(p, j)\ + \B(j,p)\) + p\\B - C T \\ 2 F (5) 

B.C — — 

p=l j=l 

using efficient coordinate descent solvers. Note that q < r, since the dimensionality of the signal subspace 
may not be known. Moreover, {A p }^ =1 are nonnegative constants controlling the sparsity of B and C. 
Indeed, the larger A p 's are chosen, the closer the entries of B and C are driven to the origin. Taking into 
account that the 'clairvoyant' compression and reconstruction matrices satisfy B Q = C„ = XJ S j9 , the term 
/i||B — C T |||i ensures that B and C T stay close. Although B G and C„ are orthonormal, B and C are 
not constrained to be orthonormal in ® because orthonormality constraints of the form B T B = I and 
CC T = I are nonconvex. With such constraints present, efficient coordinate descent algorithms converging 
to a stationary point cannot be guaranteed. 

Remark 1: The minimization problem in © resembles the sparse PCA formulation proposed in |[36l . 
However, the approach followed in ll36l imposes sparsity only on C, while it forces matrix B to be 
orthonormal. The latter constraint mitigates scaling issues (otherwise C could be made arbitrarily small 
by counter-scaling B), but is otherwise not necessarily well-motivated. Without effecting sparsity in B, 
the formulation in ll36l does not fully exploit the sparsity present in the eigenvectors of S s , which further 
results in sparse clairvoyant matrices C Q and B Q in the absence of noise. Notwithstanding, ([5]) combines 
the reconstruction error n _1 ||X — BCX|||, with regularization terms that impose sparsity to both B and 
C. Even though the £i-norm of C together with ||B — C T |||, suffice to prevent scaling issues, the ^i-norm 
of B is still needed to ensure sparsity in the entries of B. 

B. Block Coordinate Descent Algorithm 

The minimization problem in ([5]) is nonconvex with respect to (wrt) both B and C. This challenge 
will be bypassed by constructing an iterative solver. Relying on block coordinate descent (see e.g., 1H pg. 
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160]) the cost in ((5]) will be iteratively minimized wrt B (or C), while keeping matrix C (or B) fixed. 
Specifically, given the matrix B r _i at the end of iteration r — 1, an updated estimate of C G at iteration 

r can be formed by solving the minimization problem [cost in ([5]) has been scaled with n] 

i 

C T = argmin ||X - B T _iCX||| + £ MCjjHi + A*||C - K-i\\f (6) 
° p=i 

where Q T p . denotes the pth row of C, while n is absorbed in \i and A p . After straightforward manipulations 
© can be equivalently reformulated as 

g 

C T = argmin tr(X T C T B^ 1 B T _ 1 CX-2X T C T B^_ 1 X)+ V[A p ||C^|| 1 +^||C^||i-2^B T „ 1 , p ] (7) 

p=l 

where B r _i :p corresponds to the pth column of B T _i. After introducing some auxiliary variables j }?=i 
the optimization problem in §7} can be equivalently rewritten as a convex optimization problem that has i) 
a cost given by MX^B^B^ CX) - 2tr(X r C r B^ 1 X) + £« =1 A p £J =1 t p>j + /x £« =1 II C£ 111 - 
2jU^ p=1 Cj.B T _i ::p ; and ii) a constraint set formed by the inequalities {|C(p, j)| < tp^} p ^ lp=1 . This 
constrained minimization problem can be solved using an interior point method (H. 

Given the most recent DR update C r , an updated estimate of the reconstruction matrix B Q is obtained 

as 

B r = argmin ||X - BC r X||^ + V A p ||B :p || 1 + (i\\B - C^|||. (8) 

B 

The minimization problem in ([8]) can be split into the following p subproblems: 

9 

B TJ: = argmin ||X i: - Bj.C T X\\ 2 2 + p\\Bj. - (C^fWl + ^ A p |B(j»| 

P =i 

<? 

= argmin ||[X j: , ft 1 ' 2 (C T , :j ) T } - Bj._[C T X , ^%]\\ 2 2 + ^ X p \B(j,p)\, j = l,...,p (9) 

p=i 

where X J: denotes the jth row of X. 

Notice that (O corresponds to a Lasso problem that can be solved efficiently using e.g., the LARS 
algorithm ifTOl . The proposed block coordinate descent (BCD-) S-PCA algorithm yields iterates B T and 
C r that converge, as r — > oo, at least to a stationary point of the cost in (O- a fact established using the 
results in e.g., ll30l Thm. 4.1] (see also arguments in Apdx. B). The BCD-SPCA scheme is tabulated as 
Algorithm Q] 

C. Efficient SPCA Solver 

Relying on the BCD-SPCA algorithm of the previous section, an element-wise coordinate descent 
algorithm is developed here to numerically solve ([5]) with reduced computational complexity. Specifically, 
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Algorithm 1 : BCD-SPCA 

Initialize Bo = Cq = U s , g , where U s g is the signal subspace estimate found by standard PCA. 
for t = 1,. . . do 

Find C T by solving Q. 

Find B r by solving (O for j = 1, . . . ,p. 

If |Cost T — Cost r _i| < e for a prescribed tolerance e then break 
end for 



the cost in (0) is iteratively minimized wrt an entry of either B or C, while keeping the remaining entries 
fixed. One coordinate descent iteration involves updating all the entries of matrices B and C. 

Given iterates B r _i and C r _i, the next steps describe how the entries of C r and B r are formed. Let 
<g> denote the Kronecker product, and vec(C) the qp x 1 vector obtained after stacking the p columns of 
C. Using the property vec(B r _iCX) = (X T eg) B r _i)vec(C), the cost in ([5]) after setting B = B r _i can 
be re-expressed as 

Q 

||vec(X) - (X T B r _i)vec(C)||^ + £ A„(||C£||i + ||B T _i, : Ji) + ^||B T _i - C T f F . (10) 

P =i 

Next, we show how to form C T (p,j) at iteration r, based on the most up-to-date values of B and 
c v := vec(C), namely B T _i, {c v ,T^i(m)} Q ^ ={j _ 1)q+p+1 and {c 1 , iT (m)}^~J )9+p ~ 1 . It follows from (QUJi 
that C T (p,j) = c V)T ((j — l)q + p), for p = 1, . . . , q and j = 1, . . . ,p, can be determined as 

C T (p,j) = argmm||x rpj - - cX BT _ u . {j _ 1)q+p \\l + p(c - B r _i(j, p)f + X p \c\ (11) 

where X^^ := X r ®B r _i, x T>P> j : = vec ( x )-Em=i k+ ^ 1 c V)T {m)±B T _ 1 - m -YZ={j-i) q + P +i ^.t-iM 
Xg T l . m , and Xs T _ 1 ,:m corresponds to the mth column of Xb t1 . Interestingly, the minimization in (fTTb 
corresponds to a sparse regression (Lasso) problem involving a scalar. The latter admits a closed-form 
solution which is given in the next Lemma (see Apdx. A for the proof). 

Lemma 1 The optimal solution of the minimization problem 

c = arg min \\x — ch||| + p(c — b) 2 + A|c| (12) 

c 

where y and h are column vectors and b is a scalar constant, is given by 
c = sgn ( x r h + pb ) x 



X T h + pb 



2||h 



11 + 2^ 
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Based on Lemma Q] one can readily deduce that the optimal solution of (fTTT ) is given by 

C T (p,j) =Sgn ((Xr,p,j) T ^B T ^,:(j^l) q +p + fJ&T-lU'P)) 
(Xr,p,j) T ^B^ 1 ,:(j~l)g+p + ^B T _i(j» 



2 ll X B T _i,:(j-l)q+pll2 + 



(13) 



ll X B T -i,:C3-l)g+pll2 + V 

where (•)+ := max(-,0). 

Similarly, starting from the minimization problem in (© and applying an element-wise coordinate descent 
approach an update for the B(j, p) can be obtained as 

B T (j,p) = argmin ||</> Tjij0 - 6Xcr r , :p ||| + p(b - C T (p,j)) 2 + \ p \b\, p = l,...,q (14) 

where X Cr := X T C^, := X :i - £f=i B T (j, l)± Cr ,l ~ £? =p+ i B r _i(j,Z)X Cr ,:j, and ± Ct ,i 

denotes the Zth column of Xc t , while X y - refers to the jth column of X. The optimal solution of the 
minimization problem in (fT4l is given by 



BtO'.P) = S § n ( (^r,j,p) TX a,:p + pC T {p,j 



^r,j,p) T ^C T ,:p + liC T (p,j) 



\X C ^:p\\l + fJ, 



211X^,^111 + 2m/ 
(15) 



Note that iteration r involves minimizing © wrt to each entry of C or B while fixing the rest. It is shown 
in Appendix B that the computationally efficient coordinate descent (ECD)-SPCA scheme converges, as 
r — > oo, at least to a stationary point of the cost in ((5]) when the entries of X are finite. The proof 
relies on [ 30 , Thm. 4.1]. Using arguments similar to those in Appendix B, it can be shown that BCD- 
SPCA converges too. A stationary point for the nondifferentiable cost here is defined as one whose lower 
directional derivative is nonnegative toward any possible direction [30, Sections 1 and 3], meaning that the 
cost cannot decrease when moving along any possible direction around and close to a stationary point. A 
strictly positive p ensures that the minimization problems in (fTTT) and (fT4l) are strictly convex with respect 
to either C or B. This guarantees that the minimization problems in (fTTT) or (fT4l have a unique minimum 
per iteration, which in turn implies that the ECD-SPCA algorithm converges to a stationary point. If p = 0, 
the proposed algorithms may not converge. This can also be seen from the updating recursions ( fT3l ) and 
([TBI ). If p = and at a certain iteration one of the matrices is zero, say C r , this may cause the entries of 
B r to diverge. Similar comments apply in BCD-SPCA. 

The optimal solution in (fT3l and (fl3T > can be determined in closed form at computational complexity 
0(np). With 2qp entries in C and B, the total complexity for completing a full coordinate descent iteration 
is Oinqp 2 ). The ECD-SPCA scheme is tabulated as Algorithm |2 

Note that the sparsity coefficient A p is common to both terms ||C^||i and ||B :p ||i. This together with 
the explicit dissimilarity penalty in ([5]) force the estimates obtained via ECD-SPCA (denoted B r and C^) 
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Algorithm 2 : ECD-SPCA 

Initialize Bo = Cq = U s , g , where U s g is the signal subspace estimate found by standard PCA. 
for t = 1,. . . do 

For j = 1, . . . , p and p = 1, . . . ,q determine C r (p, j) using ( fT3l ). 

For j = 1, . . . ,p and p ~ 1, . . . , q determine B T (j, using ( IT5l ). 

If |Cost T — Cost r _i| < e for a prescribed tolerance e then break 
end for 



to be approximately equal for sufficiently large r. The latter equality requirement is further enforced in the 
BCD- (or ECD-) SPCA scheme by selecting [i sufficiently large, e.g., in our setting /i = 100. The ideal B G 
and are orthonormal and equal in the noiseless case; thus, the same properties are imposed to B T and 
Cy . Towards this end, we: i) pick one of the matrices B T and C^, say the latter; ii) extract, using SVD, 
an orthonormal basis U s G MP xq spanning the range space of C T ; and iii) form the compression and 
reconstruction matrices B r = C^ = U s . Thus, the dimensionality of each acquired vector x is reduced 
at the encoder using the linear operator U s , while at the decoder the signal of interest is reconstructed as 
s = U s (Ujx) = (C^C r )"l"C^(C r x), where the symbol f denotes matrix pseudoinverse. 

D. Tuning the sparsity-controlling coefficients 

Up to now the sparsity-controlling coefficients were assumed given. A cross-validation (CV) based 
algorithm is derived in this section to select {X p } q p=1 with the objective to find estimates B and C leading 
to good reconstruction performance, i.e., small J rec (B, C). The CV scheme is developed for the noiseless 
case (x t = s t ). 

Consider the M-fold CV scheme in ifTTl pg. 241-249], for selecting {X p } q p=1 from a g-dimensional grid 
C := {A}, . . . , X{} x ... x {Ag, . . . , Ag}, where {0 < A^ < . . . < \ J p } q p=l denote the candidate values, 
and x denotes Cartesian product. The training data set {x t }™ =1 is divided into M nonoverlapping subsets 
{X m }^ =1 . Let B_. m ({A p }^ =1 ) and C_ m ({A p }^ =1 ) denote the estimates obtained via BCD-SPCA (or 
ECD-SPCA) when using all the training data but those in X m , for fixed values of the sparsity-controlling 
coefficients. The next step is to evaluate an estimate of the reconstruction error using X m , i.e., form the 
reconstruction error estimate «/ r 'S({Ap}p =1 ) := |X m | _1 ||X m — B_ m C_ m X m |||,, where |X m | indicates the 
cardinality of X m . A sample-based estimate of the reconstruction MSE can be found as 

n 

4c({A p }* =1 ) = n" 1 £ J^ t] \{\ P Y P=1 ) ( 16 ) 
t=i 

where m(t) denotes the partition index in which x t is included during the CV process. 
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Using (fT6l ). the desired sparsity-controlling coefficients are selected as 

({A P }^ =1 ) := arg min J IBC ({\ p } q p=1 ). (17) 

The minimization (fTTT ) is carried out using exhaustive search over the grid C. Fig. [2] (right) shows how 
the reconstruction error J rec (-) is affected by the sparsity controlling coefficients. A simplified scenario is 
considered here with {X p = X} q p=1 , with p = 14, and q = r = 2. Matrix ~S X = S s is constructed so that 
80% of the U s entries are zero. The black bars in Fig. [2] (right) quantify the standard error associated 
with each reconstruction MSE estimate in the red CV curve and its amplitude is calculated as 



Std(A) = 



\ 



[12 

t=l 



n 

l)-^[JZ (t) (X)- J rec (A)] 2 . (18) 



When A < 0.1 the reconstruction MSE remains almost constant and equal to the one achieved by standard 
PCA (A = 0). If A > 10 ' 5 , the reconstruction MSE increases and reaches a maximum equal to the trace 
of Y, x (the DR and reconstruction matrices equal zero). The minimum MSE is achieved for A 1. Note 



that Ap's have so far been selected for a fixed value of //. Recall that \i controls the dissimilarity, of C T 
with B, thus a relatively large value (fi = 100 was used in the simulations) suffices to ensure that C and 
B stay close. Of course, the higher is the closer C T and B will be. 

IV. S-PCA Properties 

In this section sufficiently many training data are assumed available (n — > oo), to allow analysis based 
on the ensemble covariance S x . = T, S + 'E W . Recall that (al) ensures a.s. convergence of the sample-based 
cost in © to its ensemble counterpart 

q 

(B e , C e ) E argmintr[(I - BC)S X (I - BC) T ] + V A p (||Cj||i + ||B :p ||i) + H|B " C T \\ 2 F . (19) 

P =i 

Interestingly, it will turn out that even in the presence of colored noise w t , the solution pair B e , C e can 
recover the support of the columns of the signal subspace XJ s , r , or at least part of it, as long as the noise 
power in x t is sufficiently small. 



A. Support recovery in colored noise 

In this section entries of C e (or B e ) will be considered nonzero only if their corresponding magnitudes 
exceed an arbitrarily small threshold 5 > 0. Under this condition it will be demonstrated next that for 
properly selected /i and A p , S-PCA assigns the (non)zero entries in C e consistently with the support of the 
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columns of U s r . This means that the S-PCA formulation is meaningful because it does not assign entries 
of C e (or B e ) arbitrarily, but takes into account where the (non)zero entries of XJ s , r are. Interestingly, this 
will hold even when colored noise is present, as long as its variance is properly upper bounded. 

To proceed, let S s be a permuted version of a block diagonal matrix. Specifically: 
(a2) The entries of s can be partitioned into groups Gi, ■ ■ ■ ,Gk> so that entries with indices in the same 
group are allowed to be correlated but entries with indices in different groups are uncorrelated; i.e., if 
j £ Gk and f £ Gk'> then E[s(j)s(j')] = for k ^ k'. Moreover, let {Gk\k=i nave the same cardinality 
that is equal to G. Using the proper permutation matrix P, these groups can be made contiguous; hence, the 
vector sp := Ps = [sg 1 . . . sg K ] T has covariance matrix with block diagonal structure, that is PS S P T = 
bdiag(Xl SSi , . . . , H SgK ). This implies that the eigenvector matrix U s of PH S P T is block diagonal and 
sparse. Since U s = P r U s and P is a permutation matrix, it follows that U s is also sparse. 
The block-diagonal structure under (a2) emerges when s t corresponds e.g., to a random field in which the 
groups {Gk}k=i correspond to different regions affected by groups of uncorrelated sources. Each of the 
sub-vectors sg k in sp contains sources affecting a certain region in the field and are uncorrelated from the 
other sources present in sp. It is worth stressing that (a2) does not prevent applicability of the ECD-SPCA 
(or BCD-SPCA) algorithm, but it is introduced here only to assess its asymptotic performance. Before 
stating the result proved in Appendix C, let v(J r ) denote the entries of vector v with indices belonging to 
the set T . Further, let <S(v) denote the support of v, i.e., the set of indices of the nonzero entries of v. 
Proposition 1 Let S x = H S + 'S W , with S s satisfying (a2). Further, assume that the spectral radius of 
namely d nmx {H w ), satisfies d max (£l w ) < A(S S ), where A(S S ) > is a function of S s . If {X p = A}^ =1 
are selected such that ||C e p : ||o > % then for any arbitrarily small 5 > there exists a /x such that for 
any fi > fi the minimization in dl9l ) admits an optimal solution satisfying 

l|C^(5 ip )||i < 5, and ||Cj p: (5 ip )||i > £(A P ) > (20) 

||B e , !p (5 i<> )||i < 5, and ||B ei:p (5 ip )||i > £(A P ) > 0, forp = l,...,q (21) 

where Si p is the complement of the support Si p of u s> j p , while {ii, . . . ,i q } C {1, . . . , r}. The constant 
^p(Ap) depends only on X p and is strictly positive for a finite X p . 

Prop. Q] asserts that for n sufficiently large, S-PCA has an optimal solution (B e , C e ) whose support is 
a subset of the true support of {u s j p=1 }* =1 even in the presence of colored noise. This is possible since 
for the /9th row of C e , there is a corresponding i p column of matrix XJ s , r such that ||Cj p .(5j p )||i < 5 for 
arbitrarily small 5, while ||Cj p .(5j p )||i > £p(A p ) > 5 > (strictly positive). Thus, all the nonzero entries 
of C^Tp. with magnitude exceeding 5 will have indices in 5j p := support(u s j p ). This happens since: i) 
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||CjT (5j p )||i can be made arbitrarily small, thus all entries of Cj . with indices in Si p can be driven 
arbitrarily close (J-close) to zero by controlling /z; and ii) ||C^„.(5j p )||i is strictly positive with £, p (\ p ) > S, 
thus some of the entries of Cj p . with indices in Si p must have magnitude greater than S. The number of 
nonzero entries in (Si p ) is determined by X p . Thus, if \ p is selected such that ||C;F „.||o = ||u S) i p ||o, 
then recovery of the whole support Si p is ensured. 

Remark 2: It should be clarified that the vectors {u s : i p } 9 p= i in Prop. [T] may not all correspond to 
the q principal eigenvectors of S s . Nonetheless S-PCA has an edge over standard PCA when colored 
noise corrupts the training data. If the observation noise is white, the eigenspaces of S x and ~E S coincide 
and the standard PCA will return the q principal eigenvectors of S s . However, if w t is colored the q 
principal eigenvectors of ~S X , namely {u X:P } q p=1 , will be different from {u SiP } q p=1 and may not be sparse. 
Actually, in standard PCA (cf. \ p = 0) the magnitude of ||C^ p . (5i p )||i depends on S u , and cannot be 
made arbitrarily small. Thus, the magnitude between the entries of C^ p . with indices in Si p relative to 
those those with indices in Si cannot be controlled, for a given noise covariance matrix. This prevents one 
from discerning zero from nonzero entries in Cj p ., meaning that standard PCA cannot guarantee recovery 
even of a subset of the support of u s> i . 

On the other hand, Prop. Q] states that S-PCA is capable of identifying a subset of (or all) the support 
index set of {u s j p }^ =1 . S-PCA is more resilient to colored noise than standard PCA because it exploits the 
sparsity present in the eigenvectors of S s . Intuitively, the £± regularization terms act as prior information 
facilitating emergence of the U s>r zero entries in C e and B e as long as the noise variance is not high. 
Although A(5] s ) has not been explicitly quantified, the upshot of Prop. [T] is that S-PCA is expected to 
estimate better the columns of U Sjr when compared to standard PCA under comparable noise power. 
Numerical tests will demonstrate that S-PCA achieves a smaller reconstruction MSE even in the presence 
of colored noise. 

B. Oracle Properties 

Turning now attention to the noiseless scenario (E x = ~S S ), S-PCA is expected to perform satisfactorily 
as long as it estimates well the q principal eigenvectors of S s . Reliable estimators of the clairvoyant 
matrices B D = C J = U^„ can be obtained when a growing number of training vectors ensures that: i) the 
probability of identifying the zero entries of the eigenvectors approaches one; and also ii) the estimators 
of the non-zero entries of \J s , q satisfy a weak form of consistency l35l . Scaling rules for the A p 's will 
be derived to ensure that the S-PCA estimates B and C satisfy these so-termed oracle properties. The 
forthcoming results will be established for the BCD-SPCA scheme (of Sec. IIII-BI ). but similar arguments 
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can be used to prove related claims for ECD-SPCA. 

To this end, consider a weighted ^i-norm in (f5]), where the sparsity-controlling coefficient multiplying 
|B(j, p)\ and |C(p,j)|, namely A p>n , is replaced by the product Wj„ n \ Pin . Note the dependence of A Pi „ 
on n, while the io's are set equal to Wj tPin := \XJ Sjn (j, p)|~ 7 , with 7 > and U s denoting the estimate 
of U s obtained via standard PCA. If XJ s (j,p) is zero, then for n sufficiently large the estimate U s „(j, p) 
will have a small magnitude. This means large weight X Ptn Wj :P:n and thus strongly encouraged sparsity in 
the corresponding estimates B(j, p) and C(p,j). The oracle properties for B r n and C r n are stated next 
and proved in Appendix D. 

Proposition 2 Let C r n in ([8]) be an asymptotically normal estimator of C Q = Uj ; that is, 



Cr,n = U^ + V^E^ n (22) 

where the jth column ofYf Tn converges in distribution, as n — > 00, to A/"(0, Se^j), i.e., a zero-mean 
Gaussian with covariance ^E c ,j- If the sparsity-controlling coefficients are chosen so that 

lim = 0, and lim ^ p,rt , = 00 (23) 



then it holds under (al) that ([8]) yields an asymptotically normal estimator ofT5 Q = ~U s ,q>' that is, 

B T , n = U S:q + V^K,n (24) 



where vec{Ei h r n )(S ) — - — > J\f(0,[YjE b ]s o ) (convergence in distribution); [E^Js denotes the submatrix 

' n— loo 

formed by the rows and columns, with indices in S Q := S(\J S q ) = 5(vec(U s q )), of the error covariance 
^E b of vec{Ei h T n ) obtained when B r n is evaluated by standard PCA. It follows that 

lim Pr[5(B r .„) = 5(U S)? )] = 1. (25) 

Following similar arguments as in Prop. [2j it is possible to establish the following corollary. 

Corollary 1 If B T n is an asymptotically normal estimator of B G , and the sparsity-controlling coefficients 
are selected as in Prop. |2] then the optimal solution of © at iteration r + 1, namely C r +i in , is an 
asymptotically normal estimator of C Q and lim n _ !>00 Pr[5(C T+ i >n ) = 5(Uj„)] = 1. 

Prop. |2] and Corollary Q] show that when the BCD-SPCA (or ECD-SPCA) is initialized properly and the 
sparsity-controlling coefficients follow the scaling rule in d23l , then the iterates B r n and C Tn satisfy 
the oracle properties for any iteration index r. This is important since it shows that the sparsity-aware 
estimators B r n and C T]n achieve MSE performance which asymptotically is as accurate as that attained 
by a standard PCA approach for the nonzero entries of U S)9 . This holds since the error covariance matrix 
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of the estimates for the nonzero entries of U Sjg , namely [EeJs,,. coincides with that corresponding to the 
standard PCA. The estimator Bo, n = C^ n = U s ,<j obtained via standard PCA and used to initialize the 
BCD-SPCA and ECD-SPCA, is asymptotically normal 10, ED- 

Remark 3: The scaling laws in (l23l resemble those in 11351 Thm. 2] for a linear regression problem. The 
difference here is that the estimate B r _i „ (or C T)n ) is nonlinearly related with C r „ (B r „ respectively). 
Thus, establishing Prop. [2] requires extra steps to account for the nonlinear interaction between C r _ n and 
B TjTl . In order to show asymptotic normality, the chosen weights Wj^ p . n are not that crucial. Actually, the 
part of the proof in Apdx. D that establishes asymptotic normality is valid also when, e.g., Wj iPin = 1 
and linin^oo rT ' A Pj „ = 0. However, the proposed weights wj^p^n are instrumental when proving that 
the probability of recovering the ground-truth support of U Sj(3 converges to one as the number of training 
data grows large. 

Although the probability of finding the correct support goes to one asymptotically as n — > oo, numerical 
tests indicate that this probability is high even when n and r are finite. This is not the case for the standard 
PCA estimator, namely tJ s ,q,n = q-principal eigenvecs(n _1 J2t=i s * s f )■ Numerical examples will also 
demonstrate that even for a finite number of training data n, the probability of identifying the correct 
support is increasing as the coordinate descent iteration index r increases. Thus, the signal subspace 
estimates obtained by BCD-SPCA (as well as ECD-SPCA) are capable of yielding the correct support of 
U S>(J even when n is sufficiently large but finite. Consequently, improved estimates of U Si9 are obtained 
which explains the lower J rec (B,C) attained by S-PCA relative to PCA. 

V. S-PCA Based transform coding 

Up to this point sparsity has been exploited for DR of data vectors with analog-amplitude entries. 
However, the majority of modern compression systems are digital. This motivates incorporation of sparsity 
also in the quantization module that follows DR. This two-stage process comprises the transform coding 
(TC) approach which has been heavily employed in image compression applications due to its affordable 
computational complexity |[T6l . However, current TC schemes do not exploit the presence of sparsity that 
may be present in the covariance domain. 

A sparsity-aware TC (SATC) is proposed here to complement the BCD-SPCA (or ECD-SPCA) algorithm 
during the data transformation step. The basic idea is to simply quantize the DR vectors using a VQ. Given 
x, the DR matrix C obtained by S-PCA is employed during the transformation step to produce the DR 
vector y = Cx. Then, VQ is employed to produce at the output of the encoder a vector of quantized 
entries f)Q = Q[y] £ Cg Xl , where Cq := {rj 1 , . . . , f) L } is the quantizer codebook with cardinality L = 2 R , 



January 18, 2012 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (TO APPEAR) 



17 



where R denotes the number of bits used to quantize y. The VQ will be designed numerically using the 
Max-Lloyd algorithm, as detailed in e.g. lfl~3l . which uses {y t = Cx t }™ =1 to determine the quantization 
cells {TZfj }[Lv anc ^ foeii corresponding centroids, a.k.a. codewords {fji}f =1 . 

During decoding, the standard process in typical TC schemes lfT3l . lfl~6l is to multiply t)q = Q[Cx\ 
with the matrix (C T C)^C T , and form the estimate s = (C t C)^C t t)q. This estimate minimizes the 
Euclidean distance ||t)q — Cu|| wrt u. Note that the reconstruction stage of SATC is also used in the DR 
setting considered in Sections [TT] and JIIJ except that rjn is replaced with the vector Cx whose entties 
are analog. The reason behind using only C, and not B, is the penalty term ||B — C T |||. which ensures 
that C and B T will be close in the £2 -error sense. Certainly, B could have been used instead, but such 
a change would not alter noticeably the reconstruction performance. Simulations will demonstrate that 
the sparsity-inducing mechanisms in the DR step assist SATC to achieve improved MSE reconstruction 
performance when compared to related sparsity-agnostic TCs. 

VI. Simulated Tests 

Here the reconstruction performance of ECD-SPCA is studied and compared with the one achieved by 
standard PCA, as well as sparsity-aware alternatives that were modified to fit the dimensionality reduction 
setting. The different approaches are compared both in the noiseless and noisy scenarios. Simulation tests 
are also performed to corroborate the oracle properties established in Sec. IIV-BI The SATC is compared 
with conventional TCs in terms of reconstruction MSE using synthetic data first. Then, SATC is tested in 
an image compression and denoising application using images from HI. 

A. Synthetic Examples 

The reconstruction MSE J rec (B m , C m ) is measured for matrices B m and C m obtained via: i) ECD- 
SPCA; ii) the true signal subspace, i.e., B = = U S(? ; iii) a 'sample' -based PCA approach where 
Ux,g is used; iv) a genie-aided PCA which relies on (iii) but also knows where the zero entries of U Si? are 
located; v) the sparse PCA approach in |[36l abbreviated as ZSPCA; vi) the scheme in |[33l abbreviated as 
SPC; and vii) the algorithm of 13, which is abbreviated as DSPCA. With p = 14 and n = 50, the MSEs 
throughout the section are averaged over 200 Monte Carlo runs using a data set that is different from the 
training set X. In the noiseless constructed to be a permuted block diagonal matrix 

with r = 8, while 80% of the entries of U s r are zero. The sparsity-controlling coefficients multiplying 
|B(j, p)\ and |C(p, j)| are set equal to A Pin |U x (i, j)|~ 7 , with 7 = 1 and A Pjn ~ n 3 . Fig. |3] (left) depicts 
J r ec(-) versus q. The sparsity coefficients in the sparsity-aware approaches are selected from a search grid 
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to achieve the smallest possible reconstruction MSE. Clearly ECD-SPCA exploits the sparsity present in 
S s and achieves a smaller reconstruction MSE that is very close to the genie-aided approach. Note that in 
Fig. [3] (left) there are seven curves. The curve corresponding to sample-based PCA almost overlaps with 
the one corresponding to SPC. It is also observed that the more sparse the eigenvectors in XJ s , r are, the 
more orthogonal are the ECD-SPCA estimates B and C. This suggests that the i\ regularization terms 
in S-PCA induce approximate orthogonality in the corresponding estimates, as long as the underlying 
eigenvectors forming U Sjr are sufficiently sparse. 

Fig. [3] (right) the reconstruction MSE is plotted as a function of the observation SNR, namely SNR b s := 
101og 10 [tr(S s )/tr(S„,)]. The colored noise covariance matrix is factored as E w = M W M^, where M„, 
is randomly generated matrix with Gaussian i.i.d. entries. The ECD-SPCA scheme is compared with the 
sparsity-agnostic standard PCA approach. With r = q = 3 and p = 14, S s is constructed to be a permuted 
block diagonal matrix such that 70% of the entries of the eigenmatrix \J s>r are equal to zero. All sparsity 
coefficients in ECD-SPCA are set equal to A = 5 * 10 -3 . Fig. [3] (right) corroborates that the novel S- 
PCA can lead to better reconstruction/denoising performance than the standard PCA. The MSE gains are 
noticeable in the low-to-medium SNR regime. The sparsity imposing mechanisms of ECD-SPCA lead to 
improved subspace estimates yielding a reconstruction MSE that is close to the one obtained using U s ,q- 
This result corroborates the claims of Prop. Q] 

The next three figures validate the S-PCA properties in the noiseless case (see Sec. IIV-BI) . Consider 
a setting where p = 14, r = 8, and q = 2, and ~E X = S s constructed to be a permuted block diagonal 
matrix such that \J S:T has 70% of its entries equal to zero. The A's are selected as in the first paragraph. 
Fig. 0](left) displays the signal subspace estimation MSE £7[|||C Tjn | — lUjJH^], where the | • | operator 
is applied entry-wise and is used to eliminate any sign ambiguity present in the rows of C r n . As the 
training data size goes to infinity, the estimation error converges to zero. The convergence speed is similar 
to the one achieved by standard PCA. Similar conclusions can be deduced for the reconstruction MSE 
shown in Fig. |4] (right). The reconstruction MSE associated with ECD-SPCA is smaller than the one 
corresponding to standard PCA. The MSE advantage is larger for a small number of training data in 
which case standard PCA has trouble locating the zeros of U Sj(J . These examples corroborate the validity 
of Prop. |2] Interestingly, multiple coordinate descent iterations (r > 0) result in smaller estimation and 
reconstruction MSEs than the one achieved by standard PCA (r = 0). The MSE gains are noticeable for a 
small number of training samples. Such gains are expected since ECD-SPCA (or BCD-SPCA) is capable 
of estimating the true support S(XJ Sjq ) with a positive probability even for a finite n. As shown in Fig. [5] 
(left), for r > the probability of finding the true support converges to one as n — > oo (cf. Prop. 12). As 
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r increases, this probability also increases, while standard PCA (r = 0) never finds the correct support 
with a finite number of training data. 

Next, the reconstruction MSE of the SATC (SecJV} is considered and compared with the one achieved 
by a TC scheme based on standard PCA. The noisy setting used to generate Fig. [3] (right) is considered 
here with p = 14 and n = 22. With r = q = 3, S s is constructed to be a permuted block diagonal 
matrix so that 80% of the entries of U Sir are zero. Data reduction in SATC is performed via ECD-SPCA, 
while its sparsity controlling coefficients are set as described in the first paragraph of this section. Once 
the DR matrix C is obtained, the DR training data CX are used to design the VQ using Max-Lloyd's 
algorithm. Fig. [5] (right) depicts the reconstruction MSE versus the number of bits used to quantize a 
single DR vector. Fig. [5] (right) clearly shows that SATC benefits from the presence of sparsity in S s and 
achieves improved reconstruction performance when compared to the standard TC scheme that relies on 
PCA. The dashed and solid lines correspond to the reconstruction MSE achieved by ECD-SPCA and U s >r 
respectively while no quantization step is present (R = oo). 

B. Image compression and denoising 

SATC is tested here for compressing and reconstructing images. These images have size 180 x 256 
and they are extracted, as described in Sec. Jl] from a bigger image of size 2520 x 2048 in 0]]. The 
images are corrupted with additive zero-mean Gaussian colored noise whose covariance H w is structured 
as "E w = MM 1 ", where M contains Gaussian i.i.d. entries. The trace of H w is scaled to fix the SNR at 
15dB. Out of a total of 112 generated images, 30 are used for training to determine the DR matrix C, 
and design the VQ. The rest are used as test images to evaluate the reconstruction performance of the 
following three schemes: i) the SATC; ii) a TC scheme that uses DCT; and hi) a TC scheme which relies 
on PCA. 

The images are split into blocks of dimension 8x8, and each of the three aforementioned TC schemes 
is applied to each block. Here p = 64 and {x ti }?£ x is a vectorized representation of an 8 x 8 sub-block that 
consists of certain image pixels, while ti G {1, . . . , 112} denotes the image index. During the operational 
mode, datum x corresponds to a noisy sub-block occupying the same row and column indices as the 
x ti 's but belonging to an image that is not in the training set. The signal of interest s corresponds to the 
underlying noiseless block we wish to recover. Each noisy datum x is transformed using either i) the SATC 
transformation matrix obtained via ECD-SPCA; or ii) the DCT; or iii) the PCA matrix. When the DCT 
is applied, then DR is performed by keeping the q largest in magnitude entries of the transformed vector. 
The reduced dimension here is set to q = 14. The q x 1 vectors are further quantized by a VQ designed 
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using the Max-Lloyd algorithm fed with the DR training vectors. At the decoder, the quantized vectors 
are used to reconstruct s by: i) using the scheme of Sec. [V] ii) multiplying the quantized data with t] Xtq 
(PC A); or iii) applying inverse DCT to recover the original block. The sparsity-controlling coefficients in 
ECD-SPCA are all set equal to A = 2 * 1(T 2 . 

Fig. [6] shows: a) the original image; b) its noisy version; c) a reconstruction using DCT; d) a re- 
construction using a PCA-based TC; and e) the reconstruction returned by SATC. The reconstructed 
images are obtained after setting the bit rate to R = 7. The reconstruction returned by SATC is visually 
more pleasing than the one obtained by the DCT- and PCA-based TCs. The figure of merit used next is 
SNR; m := 10logio[^p^], wnere -^signal denotes the power of the noiseless image and P e rror the power of the 
noise present in the reconstructed image. Fig. [7] (left) displays SNR; m versus the bit-rate of the VQ with 
SNR set at 17dB. Clearly, SATC achieves higher SNR im values compared to the DCT- and PCA-based 
TCs. SATC performs better because the ECD-SPCA algorithm used to evaluate the DR matrix C takes 
advantage of the sparsity present in H s . Similar conclusions can be drawn from Fig. [7] (right) depicting 
SNR im versus the SNR for q = 14 and R = 7 bits. 

VII. Concluding Remarks 

The present work dealt with compression, reconstruction, and denoising of signal vectors spanned by 
an orthogonal set of sparse basis vectors that further result a covariance matrix with sparse eigenvectors. 
Based on noisy training data, a sparsity-aware DR scheme was developed using i?i-norm regularization to 
form improved estimates of the signal subspace leading to improved reconstruction performance. Efficient 
coordinate descent algorithms were developed to minimize the associated non-convex cost. The proposed 
schemes were guaranteed to converge at least to a stationary point of the cost. 

Interesting analytical properties were established for the novel signal subspace estimator showing that 
even when the noise covariance matrix is unknown, a sufficiently large signal-to-noise ratio ensures that 
the proposed estimators identify (at least a subset) of the unknown support of the signal covariance 
eigenvectors. These results advocate that sparsity-aware compression performs well especially when a 
limited number of training data is available. Asymptotic normality is also established for the sparsity- 
aware subspace estimators, while it is shown that the probability of these estimates identifying the true 
signal subspace support approaches one as the number of training data grows large. Appropriate scaling 
laws for the sparsity-controlling coefficients were derived to satisfy the aforementioned properties. 

Finally, the novel S-PCA approach was combined with vector quantization to form a sparsity-aware trans- 
form codec (SATC) that was demonstrated to outperform existing sparsity-agnostic approaches. Simulations 
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using both synthetic data and images corroborated the analytical findings and validated the effectiveness 
of the proposed schemes. Work is underway to extend the proposed framework to settings involving 
compression of nonstationary signals, and processes with memory. 
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Appendix 

A. Proof of Lemma [7} The minimization problem in (fT2l can be equivalently expressed as 

c = argmin \\x — ch|| 2 + p(c — b) 2 + Xt, s. to —t<c<t (26) 

c 

and the derivative of its Lagrangian function involving multipliers v\ and V2 is given by 

V c £(c, u\, v<2) = 2ch T h — 2ch T x + 2c/i — 2pb + v% — U2- 

After using the KKT necessary optimality conditions EJ pg. 316], it can be readily deduced that the 
optimal solution of ([T2l is given by the second equation in Lemma Q] □ 

B. Proof of convergence of ECD-SPCA 

Let f({B(j, p), C(p, j)}?'l ljP=1 ) denote the S-PCA cost given in ©, defined over M 2p<?xl ; and / ({B(j, p), 
c (pJ)Yj=i,p=i) n -1 ||X - BCX||| + p\\B - C T \\ 2 F . Next, consider the level set 

T° := {{B(j»,C(p,i)}^ liP=1 : f({B(j, p), C(p, j)} p /l ljP=1 ) < /(B ,C )} (27) 

where Bo = = U Si g coiTespond to the matrices used to initialize ECD-SPCA obtained via standard 
PCA. If Bo and Co have finite £i-norms, the set T° is closed and bounded (compact). The latter property 
can be deduced from © and (|27T ). which ensure that matrices B and C in J 70 satisfy ^ • A p (|B(j, p) \ + 
|C(p, j)| < /(Bo, Co). Moreover, /(Bo, Co) is finite when Bo and Co have finite norms. This is true 
when the training data X contain finite entries. Thus, T is a compact set. Further, the cost function /(■) 
is continuous on F°. 

From (ITTb and (IT4T > it follows readily that the minimization problems solved to obtain C r ( j, p) and 
B T (p,j), respectively, are strictly convex. Thus, minimizing /(•) with respect to an entry of C or B 
yields a unique minimizer, namely C T (p,j), or B r (j, p). Finally, /(•) satisfies the regularization conditions 
outlined in 11301 (Al)]. Specifically, the domain of /o(-) is formed by matrices whose entries satisfy 
B(j, p) G (— oo, +oo) and C(p,j) G (— oo, +oo) for j = 1, . . . ,p and p = 1, . . . , q. Thus, domain(/o) = 
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(— oo,oo) 2(?p is an open set. Moreover, /o(-) is Gauteaux differentiable over domain(/o). Specifically, the 
Gauteaux derivative of /o(-) is denned as 

/i(B, C; As, Ac) := lim /»(B + C + ft A c ) - /„(B, C) 

Applying simple algebraic manipulations it follows readily that the Gauteaux derivative exists for all 
A Bl A c e domain(/ ), and is equal to 2/itr[(B - C r )(A B - A^) T ] - 2tr[(X - BCX)(A B CX + 
BAfX) 1 ]. Then, convergence of the ECD-SPCA iterates to a stationary point of the S-PCA cost /(•) is 
readily established using 11301 Thm. 4.1 (c)]. □ 

C. Proof of Proposition^} It can be shown by contradiction that for every e > there exists a p e such that 
for any p > p e it holds that ||B e — Cj||i < e/2. Given that H x = \J X Y) X \J X , and since ||B e — Cj||i < e/2 

for \i > [i e , the minimization problem in (fT9l can be equivalently rewritten as 

<? 

C e (e) G arg min tr (dV 2 U^(I 9 - 2C T C + C T CC T C)U a; Dy2j +^2A|| Cj||i + (f>(C,e,fi) (28) 

where <^>(C, e, /i) is a continuous function of C and e, while (p(C, 0, /u) = 0. 

Let us now consider how the support of each of the rows of C e is related to the support of the principal 
eigenvectors {u SjP } r p=1 . To this end, remove </>(C,e, //) from (1281 and consider the minimization problem 

<? 

C e G argmintr[(I- C T C)S X (I- C T C) T ] + ^2A||C^||i. (29) 
° p=i 
Since the cost in (|28l is continuous, one recognizes after applying a continuity argument fTT] pg. 15], 

that for any 5 > a sufficiently large ^,5 can be found such that for any [i > max(/ij,/j e ) there exists an 

optimal solution B e , C e in (1281 ). as well as an optimal solution C e in d29l such that ||C e — C e ||i < <^/2 

and ||B e - CjT||i < ||B e - CjT||i + ||C e - C e ||i < (5 (details are omitted due to space limitations). As the 

optimal solutions of d28l and d29l can be arbitrarily close, one considers the simpler of the two in ( |29l . 

Given that S x = S s + S„ = U Sir D Sjr Uf r + U m D„U^, the minimization in d29l can be rewritten as 

g 

CeGargmm tr (c(£ S)P + £„,, P )C T (CC T - 2I ? )J + 2Aj] ||Cj||i (30) 
c p=i 

where C = CP T , X) Sj p := PS S P T , E^p := PS„,P T while P is a permutation matrix constructed 
so that S s> p is block diagonal, and C e = C e P T denotes one of the optimal solutions of (l30l . Since the 
£i-norm is permutation invariant, it holds that ||C^||i = ||C^||i. 
The minimization problem in (l30l) can be equivalently written as 

q p 

C e G argmin tr (cS XiP C T (CC T - 21,) J + 2A J] J] T(p, j), s. to |C(/>,j)| < T(p,j). (31) 

c p=i i=i 
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Let T denote the q x p matrix whose (p, j)fh entry is equal to T(p,j). Then, the Lagrangian of (f3TT > is 

£(C,T,Li,L 2 ) =tr(CS x ,pC T (CC T -2I 9 )) + 2Al^ xl Tl pxl +tr(Lf(C-T))+tr(L^(-C-T)), (32) 

where Li,L2 £ M' ?xp and their (p, j)th entry containts the Lagrange multiplier associated with the 
constraints C(p,j) < T(p,j) and — C(p,j) < T(p,j), respectively. The first-order optimality conditions 
imply that the gradient of £(•) wrt C should be equal to zero when evaluated at C e , i.e., 

2C e C e C e S x> p + 2C e ^x,pC e C e — 4C e 'S X) p + Li — L2 = gX p- (33) 

Similarly, the gradient of £(•) wrt T should be equal to zero at the optimum solution, T*, which leads to 

Ll + L5 = 2Al, x ilJ xl . (34) 

Moreover, the optimal multipliers should be nonnegative, i.e., L*(p, j) > and L?j(p, j) > for j,p = 
l,...,q, while the complementary slackness conditions give that L*(p, j)(C e (p,j) — T*(p,j)) = and 
L*(p,i)(-C e (p,i) - T*(p, j)) = (see e.g., pg- 316]). 

Let e p S R (?xl denote the canonical vector which has a single nonzero entry equal to one at the p-th 
position. After multiplying the left hand side (lhs) of (|33l from the left with ej. and from the right with 
C e>p: we obtain 

2C;Tp.CjrC e S X) .pC e) p; + 2C^p.Xl a:j pC^C e C ej p : — 4C^ p .'£ x ^pC e p- (35) 

p 

+ ^(L!(p,j) - L 2 (p,i))C e (p, j) = 0. 
3=1 

Note that the last summand in d35l ) is equal to 2A||C p: ||i. This follows from the aforementioned slackness 
conditions. Specifically, if C e (p, j) > then C e (p, j) = T*(p, j) > 0, which further implies that 
L2 (p,j) = and from (O it follows that ~L\(p,j) = 2A. In the same way if C e (p,j) < 0, then 
C e (p,j) = — T*(p, j) < from which it follows that L^(p, j) = 0, thus from (l34l we conclude that 
L?;(p, j) = 2A. Thus, (L^(p, j) — L^p, j))C e (p, j) = 2A|C e (p, j)\, and after some algebraic manipulations 
on (|35T ) it follows 

Ce\p: S a;,p(Ipxp - C e )C ei p : = 0.5A||C e ,p : ||l, p=l,...,q. (36) 

Summing the g different equalities in (l36l ) we obtain 

<? 

tr(S a ,p(I pX p-CrCe)C^C e ) = 0.5A^||C eiP: || 1 (37) 

P =i 

Equality (l37l ) can be used to reformulate the cost in (l30l ) without affecting the optimal solution. 
Specifically, the cost in (US can be rewritten as te(C£ Xi pC T CC T )-2te(C£ x>P C T ) + 2A2^p=i ||Cj.||i = 
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— trfCS^pC ) + L5A^^ =1 ||Cj.||i. Using the latter cost expression and expanding the lhs of the q 
different equality constraints in (l36l ) the minimization problem in (l30l ). is equivalent to 

Q 1 

C e E argmin-^CjjS^pCp. + 1.5A^ ||Cj\||i (38) 
P =i P =i 
i 

s.to (l-||C p: ||!)c£E a ,pC p: - (pj£p) (c^S x , P C i: ) =0.5A||C p: ||i, p = 1, . . . , g. 

Each one of the summands of the sum in the lhs of the equality constaints in (l38l ) can be rewritten as 

Cj.C p: Cl^ x , P C r . = C T p , (v,.. r C r C' h ) C p: >0J,p = l,...,q and p ± j. (39) 

Notice that the quantity in d39l is nonnegative since rank(S x t pCj : Cj.) = 1, while the single nonzero 
eigenvalue of £ x pC j: C£ is d max (£ x pC j: Cp = tt(£ x pC i: Cp = CjX.,.. r C j: > for C i: / 0. 
From the constraints in ( f38T > and d39l ), it follows that ||C p: ||2 < 1, otherwise ||C p: ||i would be negative 
resulting a contradiction. 

For the time being let us ignore the noise covariance matrix E^p by setting it to zero, thus E x p = E S) p. 
For the selected sparsity-controlling coefficient A in (l30l assume that the optimal solution has ||C^„. ||o = lp 
and ||C^ p .||i = k p , while 2 < l p < G for p = 1, . . . , q. This is possible since the ^i-norm is used in 
S-PCA. The case q = 1 is considered first to demonstrate the main result which is then generalized for 
q > 1. Toward this end, let Ci : = ||Ci : ||| • Uc, : i, where ||uc,:i II2 = 1- Moreover, to simplify notation 
let ci = ||Ci : |||, and 71 = ujT.jEs pug,;].; thus C^E Sj pCi : = cfji > 0. Let d\ denote the maximum 
spectral radius among all possible l\ x l\ submatrices of E s p that are formed after keeping l\ of its 
rows and columns with common indices that are determined by the indices of the l\ nonzero entries 
in the optimal C e ,i: = ||C e) i : |||ug )ej: i, and ug )e ,:i the optimal selection for Ua, : i. Then, it holds that 
71 = uJ.jS^pUj.;! < d* for any unit-vector ug ; :i for which ||uc, : i||o — h- With this notation in mind, 
q = 1 and ||Cj 1: ||i = K\ (|38T ) is equivalent to 

min— c?7x, s. to (1 — cf)c?7i = 0.5Aki, < c\ < k[ = min(l,Ki), < 71 < d?, (40) 

ci,Yi 

where the first inequality constraint in (l40l follows from the fact that c\ < 1 and ci = 1 1 CZ^ 1 : 1 1 2 < fti = 
||Ci : ||i. The Lagrangian of d40l ) is given as 

A(ci )7 i,v) = -c? 7 i + - c?)c? 7 i - 0.5A«i] +«J[ci - «i] - ^ c ci - ^71+^1(71 - dj), (41) 

where v := [uj wf] T contains the Lagrange multipliers. After i) differentiating (I4TT ) with respect 

to ci and 71; ii) setting the corresponding derivatives equal to zero; and iii) applying the complementary 
slackness conditions [see also Karush-Kuhn-Tucker necessary optimality condition in [3] pg. 316]] it 
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follows that the optimal value v\'* of the multiplier v\ should be strictly positive. The slackness conditions 
imply that v e {*(^\ -d\) = 0, then it follows that at the minimum of (00]) it holds that 7^ = d\. Now recall 
that 71 = uJ.jS^pUj^i, thus 7^ is formed when u^i = ug^a (the optimal direction toward which the 
optimal row C eiP: is pointing). 

Recall that d* = max u _ , x u? .jS s pUj^i subject to ||u.c, : i H2 = 1 an d ll u c,:i||o = h- Next, we demonstrate 
that if u~ e .[S Si pU5 iei :i = d*, then there exists a column, say the i\th in V s>r , with support such that 
||ujF e .i(£ii)||i = 0, while Hu^g ^(cSjJIli > and Si 1 denotes the complement of S^. Since Cj 1 . is a 
scaled version of u~ e . v the latter property will further imply that || 0^.(5^)111 = 0, while HC^.^JHi > 
£i(A), where £i(A) is strictly positive. Equivalently, we will show that X\ := <S(C e .i:) = 5(ug jei: i) C Q kl , 
where Q kl = corresponds to the index set of the entries of sp = Ps that belong to, say the k\\h 
diagonal block of £ S) p and k\ G {1, • • • ,K}. To this end, let u| :1 = [(ui :1 ) T , (u? :1 ) T , . . . , (u|" :1 ) T ], 
where each subvector Ug. x has G entries; and let X\^ := 5(u^ a ) with Y^k=i \^-i,k\ = h- Then, it follows 
that 

K K 
ui; :1 S SiP U g , : i = ^( U | :1 ) T S SCfc u| :1 < ^dmax^jKalU, (42) 
fc=l fe=l 

where d max (Sg s ) denotes the spectral radius of the l\ x l\ submatrix E^, formed by the G xG diagonal 
block S SSfc of X X) p after keeping Zi of its rows and columns with common indices. The inequality in (l42l 
follows since each subvector u-^ of ug i: i can have at most l\ nonzero entries. If d 1 ^ denotes the maximum 
spectral radius that can be achieved by any l\ x l\ submatrix S^ 1 that is contained in a diagonal block S Sgfe 
of E Sj p for k = 1, . . . ,K, then from (02]) and since Ylf=l ll u 5 1II2 = 1 it holds that u~ :1 £ Si pUg ;: i < d l {. 
Thus, it should hold that d 1 ^ = d*. Then, the max value d\ can be attained if and only if the indices 
of the nonzero entries of ug je: i satisfy X\ C Q kl for a k± € {1, . . . , This further implies that there 
exists an eigenvector u,^, := Pu S) , 1 with support = Qk t , for which X\ C 5^. Thus, it is deduced that 
u Ie 1(^*1) = and ||ujT e .-^SjJHi > £i(A) > since the l\ nonzero entries of u~ e :1 have indices in 5^. 
Positivity of £^(A) is ensured since Hujg.-J^ = 1 and A is selected such that C e .i: = {c\) 2 Uc. e ,:p 7^ 0. 
Since C e = C e P in ( f29T > results from permuting the columns of C e , it follows that HCjf^^j )||i = 
and llC^^JHi > e'i(A) > 0, where S ip = S(u s>ip ). 

We generalize the previous claim for the case when q > 1. As before we reexpress each of the rows 
of C as C p: = || Cp; Hfu-gjipj with ||ug i:(0 ||2 = 1 for p = 1, . . . ,q. No other assumptions are imposed 
for the direction vectors Ug^. Further, let c p = ||C p: ||2 and 7 P = u~. S S) pUg j:/ „ where p = l,...,q. 
Moreover, let 8j iP = (u^. •ug^^u-'. -E^pu^) for j ^ p and further notice that 8j iP = 8 p j > [cf. 
d39l)l, while 5j :P < j p and < jj. Also recall that A has been selected such that {||C e ,p:||i = K p} q p =i 
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and {||C ej p : ||o = ^p} p= i where 2 < l p < G. Then, the minimization problem in (I38T ) can be equivalently 

rewritten as 

g g 

min - ^ c 2 p Jp, s - to (1 ~ c p ) c p7p - ^ c 2 p c 2 5 jjP = 0.5Ak p , , < c p < K p := min(l, k p ) 
p =1 i=^d+p 

0<7 P <d* p , ,0<S j:P <jp, 5,- jP <7j, S j:P = 6 pd , j ^ p, j,p = 1,. . . ,q (43) 

where d* corresponds to the maximum value that u~ 'S x P Uc jP : can attain when ||C p: ||o = l p , while Uc iP; 

are selected such that the constraints in d43t are satisfied. The Lagrangian function of d43l is given as 

q g g 

£ 2 ({c p ,7p}^ 1 ,{^ P },v) = -^ C 2 7p + ^^[(l- c 2) c 2 7p _ c 2 p cp jtP -0.5\K p ] (44) 

p =1 p =1 j=1jVp 

q 

q q 

where v is a vector that contains the Lagrange multipliers v p ,v p ,Vp,v p ,v p ,Vj , u| , and uj . The 
KKT conditions are applied next to derive necessary conditions that the optimal solution of (l43l should 
satisfy. This involves i) differentiating (1441) wrt c p , 7 P and Sj p ; ii) setting the corresponding derivatives 
equal to zero; and iii) applying the complementary slackness conditions for the optimal multipliers v*. 
Then, it follows that at the minimum of (l43l it should hold that 5* p = 0, and 7* = d* for j, p = 1, . . . , q 
and j ytz p. From the definition of 5j tP it follows that 5* is formed using the optimal vectors us, e ,p:, 
i.e., 5* p = (uJ e .jUc } e,: P )(u r ~ e . : j'E St pUc t e,-.p)- Since 5j p = 0, it follows that the optimal direction vector 
u c,e,-.p should be selected in (I38T ) such that u~ e .jU^ e ,:p = 0, or u? e .^5] Sj pUc,e,:p = for j / p, while 
7* = uj e S Si pUg,e,:p iis equal to the maximum possible value d*. Since C e , p: = (c* ) 2 u^ e ,: P it follows that 
the rows of the optimal matrix C e should be selected such that either they are orthogonal CjT p: C e j : = 0, 
or C^ p .£ Sj pC e j : = 0. In summary the direction vector for the pth row of the optimal matrix C e in (l38l ). 
namely u^ e - p , should be selected such that 

Uc,e,:p = argmaXu| :/ ,S Si pUc, :p , S. tO (u7 J .Us j e ): p)(u? J -S S) pUa,:p) = 0, ||Uc, :p || 2 = 1, ||Uc, e ,:p)||l = lp, 

where p = l,...,q, j 7^ p. Using similar reasoning as in the case where q = 1 it follows that for 
every optimal row C e , p: there exists i p E {l,...,r} such that Z" p = S(C^ p .) C <S(u Sjip ) Letting 5j p 
be the complement of <Sj p , it is deduced that Uj e . p (5j p ) = and ||u;T (<Sj p )||i > £ p (A) > since 
the l p nonzero entries of uj e have indices in Si p . Positivity of £ p (A) is ensured since \\u~ e || 2 = 1 
and C e , p: = (c* p ) 2 \ic fi ,-.p 7^ for the selected A. Then, it follows readily that ||C e , P :(<Si„)||i = while 
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||C eiP: (5j p )||i > £ p (A) > for p = 1, . . . , q. Since C e = C e P in d29l ) results from permuting the columns 
of C e it follows that ||C^ : (5 ip )||i = and ||C^ p: (^ p )||i > ? p (\) > 0, where S lp = <S(u s , ip ). 

The latter property was proved under the assumption that S^p = 0. Consider now the general case 
where J} w ,p 7^ 0, thus S^p / S Sj p. An upper bound on the noise variance will be determined that 
ensures the validity of the earlier claims about CjT p . (or C^T p .) established in the noiseless case. Let 
Uc 5 : P be a direction vector that results a row vector C p: that belongs to the constraint set of (I38T ). while 
||uc, : p||o = ||Cp : ||o = l p - Further, assume that the support of u c - j:p is different from the support of the 
optimal uj e evaluated when ~E Wi p = 0. One sufficient condition to ensure optimality of {uj e . p } p= i in 
the presence of noise is that 

u^ :p S Si pUc i:p + u|. pE Wt pUc,: P <d*= u-^.pSs^pUc^.p, p = l,...,q. (45) 

for any iic.-p that results a feasible C p; in d38l ), while || ug,:p ||o = lp an d <5(uc, : p) 7^ 5(uc. e :p ). Given that 
uJ. p ^ Wt pu^ : p < d max (£ W: p), it follows that d45l ) will be satisfied when 

4ia X (£«,,p) < d* - u| :p S Sj pUa >:/9 . (46) 

Note that u^. p S Sj pUc >;p < d* since Uc >;p does not have the same support as u~ e that maximizes the 
problem at the bottom of pg. 28, in which S W) p = 0. Thus, the quantity in the right hand side of d46l) . 
denoted as A(S S ), will be positive. 

What remains to establish are the properties stated in Prop. 1 for CjT p . and B e :p with p = 1, . . . , q. To 
this end, recall that for any p > m&yi(ps, p e ), and for each C e , or equivalently C e , there exists an optimal 
solution C e and B e of (O for which ||C e - C e ||i < 5/2 and ||B e - Cf ||i < 5, where C e = C e P. 
Then, ||C^ p . — C e ||i < 5/2 for p = Then, it readily follows that ||C^ p .(5j p )||i < 5/2 since 

Cl P -{S lp ) = T . Moreover, ||C^(5*J|| - 5/2 < \\Cl p: (S ip )\\i < ||C^ : (^)|| + 5/2. Notice that the 
lower bound ||Cj p .(5j p )|| — 5/2 > £ p (A) — 5/2 can be made strictly positive by pushing 5/2 arbitrarily 
close to zero, which is possible by increasing p. However, £ P (A) remains strictly positive for the values 
of A p considered here, since it does not depend on p. These properties can also be established for B e ;p 
using similar arguments. □ 

D. Proof of Proposition In the noiseless case the training matrix X n = S n (note the dependence on n) 
can be written as S n = U Sj9 U^ g S n + U SiP _ g U^„_„S n . For notational convenience let T q>n = Uj^S n , 
and z q ^n = vec(U s>p „g U^ p _ g S n ) = ((U^ p _ g S n ) T <g> I p ) vec(U SiP _ g ). Using vec notation, it holds that 
vec(S n ) = (r^ n (g>Ip) \ec(U Si q) +z 9) „. Moreover, let b = vec(B) = vec(U Stq ) + V n _1 b, where \/n _1 b 
quantifies the estimation error present when estimating XJ S >q via ([8]). Using this notation and after applying 
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some algebraic manipulations the cost in ([8]) can be reformulated as 



q pp 

Jfc(b) := ||(rj n ® I p )vec(U s , g ) + z 9in - ((C T ,nS„) T ® I p ) b||| + ^ A p , n ^ |b(j p ) 

p =1 j P =p(p- i)+i 



+ /x||b - c* ||| 



b + z gi „ - Vn 31 [(E^ n S n ) T ® I p ] vec(U S) q)||l+ 



/'/» 



5] A P ,n J] IV^bO'p) + vec(U Si9 )(j p )| + nWV^b - V^Fie^l 
p =1 i P =(p-i)p+i 



(47) 



where the second inequality follows after replacing b with vec(U s>(? ) + \/re _1 b in all three terms in 
the expression following J&(b). Moreover, c^. n := vec(C^ n ), e£ n := vec((E£ n ) T ), and vec(U Sjg )(j) 
denotes the jth element of vec(U S)(? ). Recall that the optimal solution of © is B r n , and let b n := 
y / n[vec(B r n ) — vec(U Sj9 )]. We will show that the error h n which minimizes (|47"T ) and corresponds to the 
estimate B r n converges to a Gaussian random variable, thus establishing the first result in Prop. [2] 
To this end, consider the cost J&(b) — Jfi(O) which has the same optimal solution as J(,(b), since J&(0) 
is a constant. After performing some algebraic manipulations we can readily obtain 



2n 



J b (h) - Jfe(O) =T»- 1 b J [(C T)n S n S^C^„ + (Ulg) ® I pxp 

- 2 Ai n" 1 b T e; n + 2n" 1 b T [(C T , n S n S£(E^ ) T 



"°' 5 b T 

® Ir 



C r)n S n S n U Sj p_g) <8) Ip vec(U Sj p_q) 
vec(U s>(? ) 



p =1 jp=(p-l)p+l 



\/n vec(U s , g )(i p ) + \ / rF T b(j p ) - |vec(U S|? )(j 



(48) 



Next, it is proved that J&(b) — J&(0) converges in distribution to a cost G&(b), whose minimum will 
turn out to be the limiting point at which b n converges in distribution as n — > oo. It follows from (al) 
that ~S s ,n = ^ _1 S n S„ converges almost surely (a.s.) to S s as n — > oo, whereas C Tjn converges in 
distribution to U^„ (this follows from the asymptotic normality assumption). Then, Slutsky's theorem, 
e.g., see |[T4l . implies that the first term in d48l converges in distribution to b T (D s >9 (g) I p )b. Recalling 
that the estimation error e£ n is assumed to converge to a zero-mean Gaussian distribution with finite 
covariance, the third term converges in distribution (and in probability) to 0. Taking into account (al) 
and that S n S^ = nS s n = nS s + E s n , where n _1 E s n corresponds to the covariance estimation error, it 
follows readily that the second term in (148T ) is equal to 

"°' 5 b T 



2n 



C r , n SS T U s , p _ g ) ® Ip vec(U s ,p_ g ) = 2n-°' 5 b T [(Uj ? E s , n U S)P _ 3 ) ® I p ] vec(U SjP _ 



+ 2b T [(E« n E s U a 



Ip] vec(U s ,p_g) + 2n x b T [(E£ E Sjn U 8)P _ ff ) <g> Ip] vec(U s , p _ 



(49) 



Recall that E T „ converges in distribution to a zero-mean Gaussian random variables with finite covariance, 
whereas E s n adheres to a Wishart distribution with scaling matrix S s and n degrees of freedom |[T9l pg. 
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47]. Then, it follows readily that the first and third terms in (1501 ) converge to zero in distribution (thus in 
probability too). Then, we have that the lhs of (l49l ) converges to 

2b [(E c E s U s , p _ 3 ) ® I p ] vec(U s , p _ (? ) = 2b T vec(U s , p _ g U^_ (? 5] s E^) (50) 

where E c denotes the Gaussian random matrix at which E£ n converges in distribution as n — > oo. 
Similarly, we can show that the fourth summand in d48l ) converges in distribution, as n — > oo, to 

2b T [(Uj,S s E^) ® I P ] vec(U Si9 ) = 2b T vec(U Si9 E c E s U s , g ). (51) 

The limiting noise terms in (l50l and d5TT > are zero-mean and uncorrelated. Now, we examine the limiting 



behavior of the double sum in (|48T ). If vec(U Sj(J )(j) / then lim n ^oo \fn [\vec(U s , q )(j) +Vn~ 1 b(j)| — 
|vec(U S) g)(j')|] = sgn[vec(U S)? )(j)]b(j). Since io J)(0) „ converges in probability to |vec(U s>g ) (j)|~ 7 , we 
can deduce that if A Pin is selected as suggested by the first limit in (|23"T ). then the corresponding term in 
the double sum in (148T ) goes to zero in distribution (and in probability) as n — > oo. 

For the case where vec(U S)? )(j) = 0, it holds that y/n |vec(U s>(? )(j) + V n _1 b(j)| — |vec(U S)(? )(j)| 
= |b(j)|, and also \fn\ Ptn Wj^ n = y / nAp )n n 7 / 2 (v / n|vec(U Sj q(j))|)~ 7 . Since U Si9 is an asymptotically 
normal estimator for U Sj(? it follows that (- v /n|vec(U s>(J (j))|)~ 7 converges in distribution to a random 
variable of finite variance as n — > oo. Given that A Pi „ satisfies the second limit in d23l) . using the previous 
two limits and Slutsky's theorem we have that the quantity in (l48l ) converges in distribution to 

G b (b) = h^ o \D Biq ® I p ]s b So - 2b^ o [vec(U a ^ g Ujp_ 9 E a E^ - U s , g E c S s U s , g )] ^ (52) 

if b(j) = for all j S ; otherwise, the limit is oo. The notation [M]s o in (1521 denotes the submatrix 
of M whose row and column indices are in <S(U Si9 ). The optimal solution of d52l ) b is given by 

b(S a ) = ([T> s ,q ® Ip]^)" 1 [vec(U S)P _ 3 Uj p _ ? E s E^ - U s ,,E c S s U s , 9 )] So , and b(<S ) = 0. (53) 

Since the cost in d8]) is strictly convex wrt B and J&(b) — J&(0) A- G&(b) as n — )• oo, one can readily 
apply the epi-convergence results in ll22l to establish that b n —± b as n — > oo, while b corresponds to a 
zero-mean Gaussian random vector. This establishes asymptotic normality of B T n . An interesting thing 
to notice is that when setting in (H) \x = and {\ P , n = 0}p =1 (standard PCA approach) it follows that the 
corresponding cost in (l48l) converges in distribution to the one in (1521) with S Q = {1, . . . ,p}. This result 
establishes that the covariance of b is equal to [E^J^, where E^ b is the limiting covariance matrix of 
the estimation error when the standard PCA approach is employed. 

Next, we prove that the probability of finding the correct support converges to unity as n — > oo. Letting 
<Sb,u '■= <5(B T)n ), we have to show that: i) Pr({z G <Sb,ti}) "~ > °°> 1 Vz G S Q ; and ii) Pr({i G SB,n}) n ^°°> 
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Vi ^ S Q . The asymptotic normality of B r n implies that B r , n A U S](? . Since U S (? is a constant matrix, 
it also holds that B r , n — > q and the first part of the proof is established. Concerning the second part, 
differentiate the cost in (1471 ) wrt b, and apply the first-order optimality conditions to obtain an equality 
whose lhs and right hs (rhs) are normalized with n. It then holds V j G {1, . . . ,p} and p G {1, . . . , q} for 
which (p — l)p + j G Ss, n that 



-■ T r 

;C r , n S) ® I p ((Cr,nS) T <8> Ip) [vec(U M ) - b T , n ] + z ? , n - V^[(E c<n S) T ® I p ]vec(U 



+ 7= = sga[b T J(p- l)p + j)\ H ' (54) 



where [M]j : denotes the jth row of matrix M and b Tn = vec(B Tn ). The rhs in (l54l can be rewritten 
as ^--—5^-^ — — and goes to oo when lA „}^_i are selected according to d23l) . The second fraction 
at the lhs of d54l converges to in probability. Next, we show why the first and third terms in the first 
fraction converge in distribution to a zero-mean Gaussian variable with finite variance. For the first term 
this is true because: i) C r ,n A U^ 9 ; ii) n _1 SS T S s ; and hi) - v /n[vec(U Sjg ) — b Tiri ] converges in 
distribution to a zero-mean Gaussian random variable as shown earlier. This is the case for the third term 
too since: i) n~ 1 SS T S s ; ii) C r ,n ^ U^ g ; and hi) E c n converges in distribution to a zero-mean 
Gaussian random matrix. Finally, the second term in the first fraction converges in probability to a constant 
because n _1 SS T S s and C r ,n ^> U^ ? . 

Notice that the event i = (p — l)p + j£ Sb^ implies equality (1541 : hence, Vi ^ S Q Pr({i G 5B, n }) < 
Pr({eq. (1541 is true}). However, as n — > oo the probability of (l54l being satisfied goes to zero since the 
lhs converges to a Gaussian variable and the rhs goes to oo; thus, Vi ^ S Q it holds that lim n ^oo Pr({i G 

<W) =0. □ 
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Fig. 3. Reconstruction MSE vs. q in the noiseless case (left); and observation SNR in the noisy scenario with q = 3 (right). 



January 18, 2012 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (TO APPEAR) 



34 




Fig. 4. Normalized estimation MSE (gp) _1 S[|| |C T , n | - |U a ,, 1 (left); and reconstruction MSE J rec (right) vs. n for ECD-SPCA 
with p — 14 and q = 2. 
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Original image Noisy image DCT-based rec. PCA-based rec SATC rec. 




SNR im = 12dB SNR„„ = 13.6dB SNR 4m = 14.2dB SNR im = 15.89dB 

Fig. 6. Original image (leftmost); its noisy version; and its reconstruction using DCT-based TC, PCA-based TC and SATC 
(right). Reconstructed images are produced after setting q = 14 and R = 7 bits for each of the 8x8 blocks comprising the 
original noisy image. 




1 2 3 4 5 6 7 8 9 2 4 6 8 10 12 14 16 

Bit rate R Observation SNR (dB) 

Fig. 7. SNRi m vs. bit rate R (left); and observation SNR (right) for different TCs using the images extracted from l"fl . 
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